Recent advances in single cell RNA-seq (scRNA-seq) have enabled an unprecedented level of granularity in characterizing gene expression changes in disease models. Multiple single cell analysis methodologies have been developed to detect gene expression changes and to cluster cells by similarity of gene expression. However, the classification of clusters by cell type relies heavily on known marker genes, and the annotation of clusters is performed manually. This strategy suffers from subjectivity and limits adequate differentiation of closely related cell subsets. Here, we present SingleR, a novel computational method for unbiased cell type recognition of scRNA-seq. SingleR leverages reference transcriptomic datasets of pure cell types to infer the cell of origin of each of the single cells independently. SingleR’s annotations combined with Seurat, a processing and analysis package designed for scRNA-seq, provide a powerful tool for the investigation of scRNA-seq data. We developed an R package to generate annotated scRNA-seq objects that can then use the SingleR web tool for visualization and further analysis of the data – http://comphealth.ucsf.edu/SingleR.
Here we explain in details the SingleR pipeline and present examples of applying SingleR on publicly available mouse and human scRNA-seq datasets.
Reference set: A comprehensive transcriptomic dataset (microarray or RNA-seq) of pure cell types, preferably with multiple samples per cell type.
Mouse: We processed and annotated two reference mouse datasets:
Immunological Genome Project (ImmGen): a collection of 830 microarray samples, which we classified to classified to 20 main cell types and further annotated to 253 subtypes.
A dataset of 358 mouse RNA-seq samples annotated to 28 cell types. This dataset was collected, processed and shared, courtesy of Bérénice Benayoun.
Human For human datasets we use the following reference datasets:
Human Primary Cell Atlas (HPCA): a collection of Gene Expression Omnibus (GEO) datasets), which contains 713 microarray samples classified to 38 main cell types and further annotated to 169 subtypes.
Blueprint+Encode: Blueprint Epigenomics, 144 RNA-seq pure immune samples annotated to 28 cell types. Encode: 115 RNA-seq pure stroma and immune samples annotated 17 cell types. Altogether, 259 samples with to 43 cell types.
For specific applications, smaller datasets can be applicable (see the main paper for the macrophages example).
Single-cell set: Single-cell RNA-seq dataset. It is a good practice to filter-out cells with non-sufficient genes identified and genes with non-sufficient expression across cells. In all examples below we filtered-out cells containing less than 500 genes, and genes which were found in at least one sample. However, we did not observe a significant decrease in performance when considering less stringent threshold on the number of genes per cell.
Annotation: SingleR runs in two modes: (1) Single-cell: the annotation is perforemd for each single-cell independently. (2) Cluster: the annotation is perforemd on predefined clusters, where the expression of a cluster is the sum expression of all cells in the cluster. In addition, SingleR annotates cells/clusters to all cell types, or combines cell types to major cell types.
load ('~/Documents/SingleR/SingleR/data/GSE74923.RData')
load ('~/Documents/SingleR/SingleR/data/references/Immgen.RData')
SingleR.DrawScatter(singler$seurat@data,10,immgen,232)
+ Variable genes: SingleR supports to modes for choosing the variable genes in the reference dataset. 'sd' - genes with a standard deviation across all samples in the reference dataset over a threshold. We choose thresholds such that we start with 3000-4000 genes. 'de' - top N genes that have a median expression in a cell type compared to each other cell type. We use a varying N, depending on the number of cell types used in the analysis. More details can be found in the SingleR code.
out = SingleR.DrawBoxPlot(singler$seurat@data,10,ref=immgen,main_types = T,
labels.use=c('B cells','T cells','DC','Macrophages','Monocytes','NK cells',
'Mast cells','Neutrophils','Fibroblasts','Endothelial cells'))
## [1] "Number of genes with SD>0.9: 2307"
## [1] "Number of cells: 2"
print(out$plot)
The open source code is available in the SingleR Github repository.
We provide a wrapper function (SingleR.CreateFromCounts) which creates a singler object which contains a Seurat object and singler annotations with multiple references. Click on ‘code’ to see an example.
res = SingleR.CreateFromCounts(counts='GSE74923_series_matrix.txt',
annot='GSE74923_types.txt',project.name='GSE74923',min.genes=500,
min.cells=2,regress.out='nUMI',technology='C1',species='Mouse',
citation='Kimmerling et al. 2016',reduce.file.size = T,
normalize.gene.length = T,working.dir=NULL)
After reading the files with the counts and the annotations, this function creates a Seurat object using the wrapper function SingleR.CreateSeurat. Next, if the scRNA-seq data is full-length the counts are normalized to gene length (TPM). Then, the SingleR.CreateObject is called for each reference data set. Finally, calculateSignatures is called to create a reference datasets.
The resulting object is a list with the following fields:
seurat - the seurat object. Using the reduce.file.size switch will remove the raw data from the object.
In the examples below we use the Seurat package to process the scRNA-seq data and perform the t-SNE analysis. All visualizations are readily available through the SingleR web tool – http://comphealth.ucsf.edu/SingleR. The web app allows to view the data and interactively analyze it.
A dataset that was created to test the C1 platform. 194 single-cell mouse cell lines were analyzed using C1: 86 L1210 cells, mouse lymphocytic leukemia cells, and 108 mouse CD8+ T-cells.
First, we look at the t-SNE plot colored by the original identites:
out = SingleR.PlotTsne(singler$singler[[1]]$SingleR.single,
singler$seurat@dr$tsne@cell.embeddings,do.label=FALSE,do.letters=F,
labels=singler$seurat@meta.data$orig.ident,label.size = 4,
dot.size = 2)
out$p
We can then observe the classification by a heatmap of the aggregated scores. These scores are before fine-tuning. We can view this heatmap by the main cell types:
SingleR.DrawHeatmap(singler$singler[[1]]$SingleR.single.main,top.n=Inf,
clusters = singler$seurat@meta.data$orig.ident)
or by all cell types (showing the top 50 cell types):
SingleR.DrawHeatmap(singler$singler[[1]]$SingleR.single,top.n=50,
clusters = singler$seurat@meta.data$orig.ident)
We can see that the L1210 cells were classified strongly to 3 types of B-cells progenitors. We can see that the CD8 cells were mostly correlated with a specific activation of effector CD8+ T-cells. Interestingly, there is one cell that seems to be correlated with both pro B-cells and CD8+ T-cells, suggesting that this a doublet.
Another interesting application of this heatmap is the ability to cluster the cells, not by their gene expression profile, but by their similarity to all cell types in the database. We use this clustering technique in the manuscript (Figure 2b).
Note: this view maybe misleading, since each column is normalized to a score between 0 and 1 (which also depends on the cell types included in the view). Thus, a single cell may have low correlations with multiple cell types, and none of them are the accurate cell type, but in the heatmap they will all be red. Without normalization the data looks like this:
SingleR.DrawHeatmap(singler$singler[[1]]$SingleR.single,top.n=50,
normalize=F,clusters = singler$seurat@meta.data$orig.ident)
Next, we can use the fine-tuned labels to color the t-SNE plot:
out = SingleR.PlotTsne(singler$singler[[1]]$SingleR.single,
singler$seurat@dr$tsne@cell.embeddings,do.label=FALSE,
do.letters =F,labels=singler$singler[[1]]$SingleR.single$labels,
label.size = 4, dot.size = 2)
out$p
We can see that SingleR correctly annotated all the L1210 as types of B-cells, almost exclusively as B-cells progenitors. On the other hand, all CD8 cells were correctly annotated to CD8+ T-cells. It is important to remember that there are 253 types that SingleR could have chosen from, but correctly chose the most relevant cell types. Interestingly, the tSNE plot incorrectly positioned cells in the wrong cluster, but SingleR was not affected by this.
Finally, we can also view the labelings as a table compared to the original identities:
kable(table(singler$singler[[1]]$SingleR.single$labels,singler$seurat@meta.data$orig.ident))
| CD8 | L1210 | |
|---|---|---|
| B cells (B.Fo) | 0 | 1 |
| B cells (B.T2) | 0 | 1 |
| B cells (preB.FrC) | 0 | 66 |
| B cells (proB.CLP) | 0 | 1 |
| B cells (proB.FrBC) | 0 | 17 |
| T cells (T.8EFF.OT1.48HR.LISOVA) | 98 | 0 |
| T cells (T.CD8.48H) | 5 | 0 |
In the main manuscript we described our analysis to a dataset produced by Hashimshony et al. (2016), which contains scRNA-seq of fibroblasts and bone-marrow derived dendritic cells (BMDCs). We showed that 32 of those 48 cells were actually macrophages, in accordance with a Helft et al. (2015). Here we reanalyzed a seminal study in scRNA-seq, which performed one of the earliest big scale analysis of single-cells. Using the Smart-Seq protocol the authors analyzed 1775 single mouse BMDCs. The paper described in details different dendritic cells (DCs) activation states that have been observed in the data, without any mention of macrophages.
Running Seurat with the same filters as described above produced the following t-SNE plot:
load ('~/Documents/SingleR/SingleR/data/GSE48968.RData')
out = SingleR.PlotTsne(singler$singler[[1]]$SingleR.single,
singler$seurat@dr$tsne@cell.embeddings,do.label=FALSE,
do.letters =T,labels=singler$seurat@meta.data$orig.ident,
dot.size = 2)
out$p
And the SingleR annotations:
out = SingleR.PlotTsne(singler$singler[[1]]$SingleR.single,
singler$seurat@dr$tsne@cell.embeddings,do.label=FALSE,
do.letters =T,
dot.size = 2)
out$p
To make it simpler to view, we present only main cell types:
out = SingleR.PlotTsne(singler$singler[[1]]$SingleR.single.main,
singler$seurat@dr$tsne@cell.embeddings,do.label=FALSE,
do.letters =T,
dot.size = 2)
out$p
SingleR mapped all the left side to macrophages, top top right to monocytes, and only the bottom right to dendritic cells.
In a table:
kable(table(singler$seurat@meta.data$orig.ident,singler$singler[[1]]$SingleR.single.main$labels))
| B cells | DC | Macrophages | Monocytes | Neutrophils | |
|---|---|---|---|---|---|
| BMDC (IFN-B stimulation) | 0 | 14 | 69 | 0 | 0 |
| BMDC (LPS stimulation) | 0 | 270 | 567 | 17 | 2 |
| BMDC (PAM stimulation) | 0 | 82 | 237 | 26 | 0 |
| BMDC (PIC stimulation) | 1 | 143 | 216 | 9 | 0 |
| BMDC (Unstimulation) | 0 | 374 | 279 | 5 | 0 |
As in the main manuscript, we employed the dataset produced by Helft et al. (2015) (downloaded from GEO accession: GSE62361), which analyzed by microarray the expression profiles of GM-DCs and GM-Macs. We select the top 50 diffentrially expressed genes of each cell type, and present a heatmap of their expression in the single cells:
gse62631.de <- read.table('~/Documents/singler/GSE62361_DE.txt',
header=TRUE, sep="\t", row.names=1, as.is=TRUE)
bmdc.genes = gse62631.de[intersect(rownames(gse62631.de),
rownames(singler$seurat@data)),'Group',drop=F]
d = t(scale(t(as.matrix(singler$seurat@data[rownames(bmdc.genes),]))))
d[d>2]=2;d[d< -2]=-2
annotation_col = data.frame(Annotation=singler$singler[[1]]$SingleR.single.main$labels)
pheatmap(d[order(bmdc.genes$Group),
order(singler$singler[[1]]$SingleR.single.main$labels)],
cluster_cols = F,cluster_rows = F,clustering_method='ward.D',
border_color = NA,
annotation_col=annotation_col,
annotation_row = bmdc.genes,show_colnames=F,show_rownames=F)
We can see that SingleR correctly identified the DCs and the macrophages populations. Interstingly, the cells annotated as monocytes seem to contain markers of both GM-Macs and GM-DCs, suggesting that those are earlier progenitors, which have not yet committed to the DC or Macrophage lineage.
In accordance with Helft et al. this analysis urges the reevaluation of studies to characterize DCs based on BMDC. It also emphasize how using SingleR can assist in resolving the cellular heterogeneity.
We also applied SingleR to human datasets. Using the 10X platform, Zheng et al. produced >100K single cells from sorted immune and cell lines populations. We obtained this data from https://support.10xgenomics.com/single-cell-gene-expression/datasets, and processed it with the SingleR pipeline. To reduce computation times and to make the analyses simpler, we randomly selected 200 cells with >500 non-zero genes from 10 immune populations.
load ('~/Documents/SingleR/SingleR/data/10x (Zheng) - 2000cells.RData')
out = SingleR.PlotTsne(singler$singler[[1]]$SingleR.single,
singler$seurat@dr$tsne@cell.embeddings,do.label=FALSE,
do.letters =F,labels=singler$seurat@meta.data$orig.ident,
dot.size = 2)
out$p
The tSNE plots allows to distinguish most cell types, but the T-cells subsets are all blurred together.
We first look at the Seurat clustering:
out = SingleR.PlotTsne(singler$singler[[2]]$SingleR.single,
singler$seurat@dr$tsne@cell.embeddings,do.label=T,
do.letters =F,labels=singler$seurat@ident,
dot.size = 2,label.size=4)
out$p
kable(table(singler$seurat@meta.data$orig.ident,singler$seurat@ident))
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| B-cells | 0 | 0 | 0 | 200 | 0 | 0 | 0 | 0 | 0 | 0 |
| CD4+ Th | 96 | 101 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
| CD4+ Tm | 167 | 24 | 6 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
| CD4+ Tn | 1 | 192 | 4 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
| CD8+ Tcyto | 7 | 9 | 104 | 0 | 6 | 0 | 0 | 74 | 0 | 0 |
| CD8+ Tn | 1 | 21 | 177 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| HSC | 1 | 0 | 0 | 6 | 0 | 20 | 104 | 1 | 67 | 1 |
| Monocytes | 6 | 2 | 1 | 4 | 0 | 149 | 0 | 0 | 4 | 34 |
| NK cells | 0 | 0 | 0 | 0 | 199 | 0 | 0 | 1 | 0 | 0 |
| Tregs | 162 | 38 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
We can see that the performs relatively well; however, regulatory T-cells are completely dissolved in the memory T-cells cluster.
SingleR using Blueprint+ENCODE (BE) as reference produced the following annotations before fine-tuning:
SingleR.DrawHeatmap(singler$singler[[2]]$SingleR.single,top.n=Inf,
clusters = singler$seurat@meta.data$orig.ident)
We can see that before fine-tuning, there is strong blurring between T-cells states, which cannot be distinguished.
However, with fine-tuning we obtain the following annotations:
out = SingleR.PlotTsne(singler$singler[[2]]$SingleR.single,
singler$seurat@dr$tsne@cell.embeddings,do.label=FALSE,
do.letters =F,labels=singler$singler[[2]]$SingleR.single$labels,
dot.size = 2)
out$p
kable(table(singler$seurat@meta.data$orig.ident,
singler$singler[[2]]$SingleR.single$labels))
| CD4+ T-cells | CD4+ Tcm | CD4+ Tem | CD8+ T-cells | CD8+ Tcm | CD8+ Tem | Class-switched memory B-cells | CLP | CMP | DC | GMP | HSC | Memory B-cells | MEP | Monocytes | MPP | naive B-cells | NK cells | Plasma cells | Tregs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| B-cells | 0 | 0 | 0 | 0 | 0 | 0 | 46 | 0 | 0 | 0 | 0 | 0 | 97 | 0 | 0 | 0 | 57 | 0 | 0 | 0 |
| CD4+ Th | 23 | 82 | 57 | 1 | 9 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 26 |
| CD4+ Tm | 1 | 46 | 97 | 0 | 9 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 46 |
| CD4+ Tn | 68 | 108 | 7 | 1 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 9 |
| CD8+ Tcyto | 15 | 14 | 12 | 38 | 105 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| CD8+ Tn | 22 | 11 | 3 | 91 | 72 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| HSC | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 39 | 14 | 0 | 29 | 7 | 4 | 62 | 25 | 19 | 0 | 0 | 0 | 0 |
| Monocytes | 0 | 0 | 6 | 2 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 184 | 0 | 0 | 1 | 4 | 0 |
| NK cells | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 200 | 0 | 0 |
| Tregs | 0 | 47 | 59 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 90 |
By observing the colors we can see that the CD4+ T-cell cluster can rouglhy be divided in to 4 states, from naive CD4+ T-cells on the bottom (green), to central memory and effector memory CD4+ T-cells in the middle (purple and orange) and Tregs in the top (pink), in accordance with the original identites.
While it is not perfect, it provides us a much more granular view of the cell states without the need to go over many markers that might not be in the data at all, and their interpretation is sometimes confusing:
df = data.frame(x=singler$seurat@dr$tsne@cell.embeddings[,1],
y=singler$seurat@dr$tsne@cell.embeddings[,2],
t(as.matrix(singler$seurat@data[c('CD3E','CD4','CD8A',
'CCR7','GZMA','GNLY','MS4A1','CD14','CD34'),])))
df = melt(df,id.vars = c('x','y'))
ggplot(df,aes(x=x,y=y,color=value)) + geom_point(size=0.3)+scale_color_gradient(low="gray", high="blue") + facet_wrap(~variable,ncol=3) +theme_classic()+xlab('')+ylab('')
Interstingly, SingleR suggests a more granular view of the B-cell cluster, splitting it to naive and memory B-cells.
Finally, we compared our method to two other classification methods. Kang et al., Nature Biotechnology (2017) used sets of markers learned from scRNA-seq PBMCs (from Zheng et al.) to correlate each single cell:
out = SingleR.PlotTsne(singler$singler[[2]]$SingleR.single,
singler$seurat@dr$tsne@cell.embeddings,do.label=F,
do.letters =F,labels=singler$other[,'Kang'],
dot.size = 2)
out$p
We can see that this methods has limited usability.
A bulk reference-based method by Li et. al, Nature Genetics, (2017) used a reference-based approach (but without fine-tuning).
out = SingleR.PlotTsne(singler$singler[[2]]$SingleR.single,
singler$seurat@dr$tsne@cell.embeddings,do.label=F,
do.letters =F,labels=singler$other[,'RCA'],
dot.size = 2,font.size=5)
out$p
Here we can see that the microrray reference is not able to distinguish CD4+ and CD8+ T-cells, and without fine-tuning results it does not provide a sufficient solution for annotation.
SingleR also allows to detect rare events. For example, lets take a deeper look at the sorted Monocytes:
monocytes = SingleR.Subset(singler,singler$seurat@meta.data$orig.ident=='Monocytes')
out = SingleR.PlotTsne(monocytes$singler[[2]]$SingleR.single,
monocytes$seurat@dr$tsne@cell.embeddings,do.label=F,
do.letters =F, dot.size = 2)
out$p
We can see that the t-SNE plot already suggests that there are 16 cells that are not part of the main cluster (which means that the sorting purity was ~92%). SingleR detected those cells to be plasma cells (4 cells), T-cells (8 cells), 2 progenitors, 1 DC and 1 NK cell. Is SingleR correct?
SingleR.DrawHeatmap(monocytes$singler[[2]]$SingleR.single,top.n=20,clusters=monocytes$singler[[2]]$SingleR.single$labels)
We can see that SingleR is quite convinced in its calls, giving low monocytes scores to those cells. Using markers for rare cell types (at least in the monocytes sorted cells) is problematic, since marker-based analysis is focused on clusters and not on single cells.
Interstingly, as in the t-SNE plot we see two distinct monocytes clusters, showing the ability to use SingleR for clustering.
The SingleR web tool contains >35 publicly available scRNA-seq. All data has been preprocessed with the tools described above and the web tools allows the user immediate access to analyze the data and perform further investigations on published single cell data. In addition, we invite non-tech savvy users to upload their own scRNA-seq data which will be analyzed on our servers and will process a SignleR object that can then be uploaded and used on the website (privately, only the user with the object will is able to view it). Please visit http://comphealth.ucsf.edu/SingleR for more information.
Thanks,
The SingleR team
Hashimshony, Tamar, Naftalie Senderovich, Gal Avital, Agnes Klochendler, Yaron de Leeuw, Leon Anavy, Dave Gennert, et al. 2016. “CEL-Seq2: sensitive highly-multiplexed single-cell RNA-Seq.” Genome Biology 17 (1). BioMed Central: 77. doi:10.1186/s13059-016-0938-8.
Helft, Julie, Jan Böttcher, Probir Chakravarty, Santiago Zelenay, Jatta Huotari, Barbara U. Schraml, Delphine Goubau, and Caetano Reis e Sousa. 2015. “GM-CSF Mouse Bone Marrow Cultures Comprise a Heterogeneous Population of CD11c+MHCII+ Macrophages and Dendritic Cells.” Immunity 42 (6): 1197–1211. doi:10.1016/j.immuni.2015.05.018.
Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2017. “Multiplexed droplet single-cell RNA-sequencing using natural genetic variation.” Nature Biotechnology 36 (1). Nature Publishing Group: 89–94. doi:10.1038/nbt.4042.
Kimmerling, Robert J., Gregory Lee Szeto, Jennifer W. Li, Alex S. Genshaft, Samuel W. Kazer, Kristofor R. Payer, Jacob de Riba Borrajo, et al. 2016. “A microfluidic platform enabling single-cell RNA-seq of multigenerational lineages.” Nature Communications 7 (January). Nature Publishing Group: 10220. doi:10.1038/ncomms10220.
Li, Huipeng, Elise T Courtois, Debarka Sengupta, Yuliana Tan, Kok Hao Chen, Jolene Jie Lin Goh, Say Li Kong, et al. 2017. “Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors.” Nature Genetics 49 (5): 708–18. doi:10.1038/ng.3818.
Shalek, Alex K., Rahul Satija, Joe Shuga, John J. Trombetta, Dave Gennert, Diana Lu, Peilin Chen, et al. 2014. “Single-cell RNA-seq reveals dynamic paracrine control of cellular variation.” Nature 510 (7505). Nature Publishing Group: 363–69. doi:10.1038/nature13437.
Zheng, Chunhong, Liangtao Zheng, Jae-Kwang Yoo, Huahu Guo, Yuanyuan Zhang, Xinyi Guo, Boxi Kang, et al. 2017. “Landscape of Infiltrating T Cells in Liver Cancer Revealed by Single-Cell Sequencing.” Cell 169 (7): 1342–1356.e16. doi:10.1016/j.cell.2017.05.035.